A Parallel and Distributed Approach for Finding Transitive Closures of Data Records: A Proposal

نویسندگان

  • Roopa Bheemavaram
  • Ning Li
چکیده

In this paper, we propose an approach to find transitive closures on large data sets in distributed (i.e., parallel) environment. Finding transitive closures of data records is a preprocessing step of a two-step approach to data quality control, such as data accuracy, redundancy, consistency, currency and completeness. The objective of finding transitive closures is to reduce the number of records to be considered in the second step, from a whole data source having hundreds of millions to billions of records to the range of hundreds to thousands. To process hundreds of millions to billions of records, an efficient approach is essential that works in distributed environment. As a part of this approach, this paper presents an efficient distributed algorithm for solving distributed transitive closure problem on large data sets. Due to huge volumes of data, many real world applications are in need of fast and efficient approaches for data analysis and data mining. However, data cleansing, which precedes data analysis and data mining, is becoming the center of research interest in recent years. As a part of it, the process of finding transitive closures (i.e., finding all related records) is the main goal of this paper. The computation of transitive closures of data records has two related but independent activities. One is to determine if two records are related based on some definition of relatedness for virtually all pairs of records. The other is to identify the transitive closures based on the relations. A particular definition of a relation between two records is considered in this paper. And the proposed parallel and distributed approach takes advantage of this definition to improve the performance of transitive closure computation in grid computing environment.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Static Task Allocation in Distributed Systems Using Parallel Genetic Algorithm

Over the past two decades, PC speeds have increased from a few instructions per second to several million instructions per second. The tremendous speed of today's networks as well as the increasing need for high-performance systems has made researchers interested in parallel and distributed computing. The rapid growth of distributed systems has led to a variety of problems. Task allocation is a...

متن کامل

Fast Acceleration of Ultimately Periodic Relations

Computing transitive closures of integer relations is the key to finding precise invariants of integer programs. In this paper, we describe an efficient algorithm for computing the transitive closures of difference bounds, octagonal and finite monoid affine relations. On the theoretical side, this framework provides a common solution to the acceleration problem, for all these three classes of r...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

PTIME Computation of Transitive Closures of Octagonal Relations

Computing transitive closures of integer relations is the key to finding precise invariants of integer programs. In this paper, we study difference bounds and octagonal relations and prove that their transitive closure is a PTIMEcomputable formula in the existential fragment of Presburger arithmetic. This result marks a significant complexity improvement, as the known algorithms have EXPTIME wo...

متن کامل

Separating indexes from data: a distributed scheme for secure database outsourcing

Database outsourcing is an idea to eliminate the burden of database management from organizations. Since data is a critical asset of organizations, preserving its privacy from outside adversary and untrusted server should be warranted. In this paper, we present a distributed scheme based on storing shares of data on different servers and separating indexes from data on a distinct server. Shamir...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006